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Abstract 

Background: Nowadays, combining the different sources of information to improve the biological knowledge 
available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data 
types are kernel-based methods. Kernel-based data integration approaches consist of two basic steps: firstly the 
right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give 
a complete representation of the available data for a given statistical task 

Results: We analyze the integration of data from several sources of information using kernel PCA, from the point 
of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot 
the representation of the input variables that belong to any dataset. In particular, for each input variable or linear 
combination of input variables, we can represent the direction of maximum growth locally, which allows us to 
identify those samples with higher/lower values of the variables analyzed. 

Conclusions: The integration of different datasets and the simultaneous representation of samples and variables 
together give us a better understanding of biological knowledge. 



Background 

With the recent rapid advancements in high-throughput 
technologies, such as next generation sequencing, array 
comparative hybridization and mass spectrometry, data- 
bases are increasing in both the amount and the com- 
plexity of the data they contain. One of the main goals of 
mining this type of data is to visualize the relationships 
between biological variables that are involved [1]. For 
instance, visualizing gene expression guides the process 
of finding genes with similar expression patterns. How- 
ever, due to the number of genes involved, it is more 
effective to display the data by means of a low-dimen- 
sional plot. Here we focus on the problem of reducing 
dimensionality and the interpretability of the resulting 
data representations. 

Principal component analysis (PCA) has a very long 
history and is known to be a very powerful tool in the 
linear case. PCA is used as a visualization tool for the 
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analysis of microarray data [2] and [3]. However, the 
sample space that many research problems deal with is 
considered nonlinear in nature; for example, the sample 
space of microarray data. One reason for this nonlinear- 
ity might be that the interactions of the genes are not 
completely understood. Many biological pathways are 
still not fully understood. So, it is quite naive to assume 
that genes are connected in a linear fashion. Following 
this line of thought, research into reducing the non- 
linear dimensionality for microarray gene expression 
data has increased. Finding methods that can handle 
such data is of great importance if we are to glean as 
much information as possible from them. 

Kernel representation offers an alternative to nonlinear 
functions by projecting the data into a high-dimensional 
feature space, which increases the computational power 
of linear learning machines [4] and [5]. Kernel methods 
enable us to construct different nonlinear versions of any 
algorithm which can be expressed solely in terms of dot 
products; this is known as the kernel trick. Kernel 
machines can be used to implement several learning 
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algorithms but the interpretability of the resultant output 
representations may be cumbersome, because input vari- 
ables are only handled implicitly [6] . 

Nowadays, combining multiple sources of data to 
improve the biological knowledge available is a challen- 
ging task in bioinformatics. Data analysis of different 
sources of information is not simply a matter of adding 
the analysis of each separate dataset; instead it consists of 
the simultaneous analysis of multiple variables in the dif- 
ferent datasets [7], 

Some of the most powerful methods for integrating 
heterogeneous data types are kernel-based methods [8] 
and [9]. We can describe kernel-based data integration 
approaches as using two basic steps. Firstly, the right ker- 
nel is chosen for each data set. Secondly, the kernels 
from the different data sources are combined to give a 
complete representation of the available data for a given 
statistical task. Basic mathematical operations such as 
multiplication, addition, and exponentiation preserve 
properties of kernel matrices and hence produce valid 
kernels. The simplest approach is to use positive linear 
combinations of the different kernels. 

In this work, we analyze the integration of data from 
several sources of information using kernel PCA, from the 
point of view of reducing dimensionality and extending 
previous results [10]. Moreover, we improve kernel PCA 
interpretability by adding to the plot the representation of 
the input variables that belong to any dataset. In particu- 
lar, for each input variable or linear combination of input 
variables, we can represent the direction of maximum 
growth locally, which allows us to identify those samples 
with higher/lower values of the variables analyzed. There- 
fore the integration of different datasets and the simulta- 
neous representation of samples and variables together 
give us a better understanding of biological knowledge. 
This paper starts by briefly reviewing the notion of kernel 
PCA (Section 2). Section 3 contains our main results: a set 
of procedures to enhance the interpretability of kernel 
PCA when multiple datasets are analyzed simultaneously. 
We then present our results and apply them in parallel to 
analyze a nutrigenomic study in mouse [11]. 

Results and discussion 

Kernel methods enable us to construct different non- 
linear versions of any algorithm which can be expressed 
solely in terms of dot products, this is the case of kernel 
PCA. Kernel PCA can be used to reduce dimensionality, 
thereby improving on linear PCA, but the interpretabil- 
ity of the output representations may be cumbersome 
because the input variables are only handled implicitly. 

In this section, we propose a set of procedures to 
improve the interpretability of kernel PCA. The proce- 
dures are related to the following aspects: 



♦ Representation of input variables. 

♦ Data integration and representation of input 
variables. 

♦ Representation of linear combinations of input 
variables. 

♦ Revealing the interpretability of input variables. 

To illustrate these procedures we use an example 
from metabolomics and genomics. The datasets come 
from a nutrigenomic study in mouse [11]. Forty mice 
were studied and two sets of variables were acquired: 
expressions of 120 genes measured in liver cells; and 
concentrations (in percentages) of 21 hepatic fatty acids 
(FAs) measured by gas chromatography. Biological units 
(mice) are cross-classified according to two factors: gen- 
otype, which can be wild-type (WT) or PPARa -deficient 
mice (PPAR); and diet, with 5 classes of diet in accor- 
dance with the FA composition. 

The oils used for the experimental diet preparation 
were: corn and rapeseed oils (50:50), as the reference diet 
(ref ); hydrogenated coconut oil, as a saturated FA diet 
(coc); sunflower oil, as an co6 FA-rich diet (sun); linseed 
oil, as an co3 FA-rich diet (lin); and corn, rapeseed and 
fish oils (42.5:42.5:15), as the fish diet. In the study, it 
cannot be assumed that variations in one set of variables 
cause variations in the other; we do not know a priori if 
changes in gene expression imply changes in FA concen- 
trations or vice versa. Indeed, the nuclear receptor 
PPARa, which acts as a ligand-induced transcriptional 
regulator, is known to be activated by various FAs and to 
regulate the expression of several genes involved in FA 
metabolism. It should be noted that the main observa- 
tions discussed in [11], which were extracted separately 
from the two datasets by both classical multidimensional 
tools (hierarchical clustering and PCA) and standard test 
procedures, are also highlighted by kernel PCA graphical 
representations. 

Representation of input variables 

In order to achieve interpretability we add supplemen- 
tary information into kernel PCA representations. We 
have developed a procedure to represent any given 
input variable on the subspace spanned by the eigenvec- 
tors of c (see Methods). 

We can consider that our observations are realizations 
of the random vector X = {Xy X^). Then, to represent 
the prominence of the input variable Xf^ in kernel PCA, 
we take a set of points of the form: y = a + 56/^ G R^, 
where = (0, 1, 0) G R'', 5 G R, and the /c-th 
component is equal to 1 and the others are 0. Then, we 
can compute the projections of the image of these 
points, 0 (y), onto the subspace spanned by the eigen- 
vectors of Q Taking into account equation (8), the 



Reverter et al. BMC Systems Biology 2014, 8(Suppl 2):S6 
http://www.biomedcentral.eom/1752-0509/8/S2/S6 



Page 3 of 9 



induced curve expressed in matrix form is given by the 
row vector: 



i™-^imi;iv, 



where is in the form of (7). 

In addition, we can represent directions of maximum 
growth of o^{s) with respect the variable X/^ by project- 
ing the tangent vector at 5 = 0. In matrix form, we have: 
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and, using the chain rule: 



ds 



5=0 



dyk 



(2) 



y=a 



In particular, let us consider the Gaussian radial basis 
function kernel: /:(x, z) = exp(-c | |x - z| |^), with c > 0 a 
free parameter. Using the notation above, we have: 

K(y,Xj) = exp(-c||y-Xj||^) = exp {^cY^{yt - Xuf^ . 

For the set of points of the form y = a + 56^^ G R^: 



dZ\ dK[y,Xi) 

\s=0 -- 



y=a 



-2cK{2i,Xi){ak-Xik). 



ds " ^ dyk 
In addition, if a = xy^ (a training point) then: 



dK 

ds 



-2cK{xp,Xi){xf}k-Xik). 



5=0 



To illustrate our procedure we introduce a toy exam- 
ple. We have generated a dataset which has 18 points in 
6-dimensional space. Coordinates of the points are 
selected in order to distinguish 3 groups clearly sepa- 
rated. The group 1 has 6 points such that the sum of Xi 
and X2 coordinates is equal to 15 for each point. More- 
over, in this group, there are 3 points such that the sum 
of X3, X4 and Xs is 0, and is equal to 6 for each the 
another 3 points. The group 2 has 6 points such that the 
sum of X3, X4 and X^ coordinates is equal to 0 for each 
point. Besides, in this group, there are 3 points such that 
the sum of Xi and X2 is 0, and is equal to -4 for each the 
another 3 points. Finally, the group 3 has 6 points such 
that the sum of Xi and X2 coordinates is equal to 0 for 
each point. Moreover, in this group, there are 3 points 
such that the sum of X3, X4 and X5 is 15, and is equal to 



24 for each the another 3 points. All coordinates were 
perturbed with weak gaussian noise in order to introduce 
a small amount of variability inside each group. At each 
group the variable X^ is assigned randomly according to 
a Gaussian of mean zero and standard deviation 0.5. The 
configuration of the points is such that we expect that in 
reduction of dimension only the first dimensions are 
necessary to reveal the arrangement of the three groups. 
It can be seen in Figure 1 where the two leading compo- 
nents of kernel PGA are represented. We can see the 
group 1 (represented by triangles up and circles) on the 
negative part of the first principal axe, group 2 (repre- 
sented by plus signs and by cross) in the central part and 
the group 3 (represented by diamonds and triangles 
down) on the positive part. 

Figure 1 shows samples and the variables from Xi to 
Xs at each sample. Variables are represented by vectors 
that indicate the direction of maximum growth in each 
variable. In fact, we can see that the vectors point to 
those groups characterized by higher values in each vari- 
able. For instance, the variables Xi and X2 point to the 
group 1, and the variables X3, X4, and X^ point to the 
group 3. 

Figure 2 shows the variable X^ at each sample, we can 
observe that this variable is poorly represented and has 
no preferred direction towards any group. 

A natural extension of the above procedure is the 
representation of linear combinations of input variables. 
Details can be found in section 3.2. With the aim to 
show this property we displayed in Figure 3 the samples 
and the linear combinations Xi + X2 and X3 + X4 + X^ 
at each sample. Linear combinations are represented by 
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Figure 1 Kernel PCA analyzing the toy example. Variables are 
represented by vectors that indicate tine direction of maximum 
growtli in eacli variable. 
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Figure 2 Kernel PCA analyzing the toy example. Variable is 
poorly represented and the direction of maximum growth of this 
variable shows no trend to any group. 



vectors that point to the direction of maximum growth 
in each of the linear combinations. We can observe that 
at each sample vectors point to those groups with 
higher values in each of linear combinations. For exam- 
ple, vectors representing Xi + X2 point to group 1, and 
vectors representing X3 + Xi + X5 point to group 3. 
Analyzing the nutrigenomic dataset 

We illustrate the representation of variables by analyzing 
the dataset in [11]. We apply kernel PCA and represen- 
tation of variables to the genomic data and FA data. 



Firstly, we compute kernel PCA by analyzing only gene 
expression level data. Figure 4 shows the two leading 
axes of kernel PCA. We can observe that the genotypes 
are clearly separated (WT samples are represented in 
black and PPAR samples in red). Diet representation is: 
ref diet is represented by the letter x; coc diet by cir- 
cles; sun diet by diamonds; lin diet by plus signs; and 
fish diet by triangles). Figure 4 shows the AOX (blue 
vector) and CARl (green vector) genes. Vectors indicate 
the direction of maximum growth of the gene expres- 
sion at each sample point. Thus, we can observe that 
AOX increases towards WT and CARl towards PPAR. 
These results are in agreement with those found in [11] 
and [12]. Figure 5 and Figure 6 show the profiles of the 
medians of the expression of AOX and CARl grouped 
by genotype. We can observe that these profiles agree 
with the kernel PCA representation. 

Secondly, to compare results, we compute kernel PCA 
analyzing only FA levels. In Figure 7 we can observe 
that the sample points are separated by genotype, but 
we can also observe that the samples with coc diet (a 
diet with hydrogenated coconut oil as a saturated FA 
diet) form a cluster. Figure 7 shows C20,2co,6 (green 
vector) and C16.0 (blue vector) FAs. It reveals higher 
levels of C20.2a;.6 towards PPARa-deficient clustered 
samples (red) and that levels of C16.0 are higher 
towards the WT cluster of samples (black). 

These results are also in agreement with those found 
in [11] and [12]. Figure 8 and Figure 9 show the profiles 
of the medians of the concentrations of C16.0 and 
C20.26L> FAs, grouped by genotype. We can observe that 
these profiles agree with the kernel PCA representation. 
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Figure 3 Kernel PCA analyzing tlie toy example. Linear 
combinations of variables are represented by vectors that indicate 
the direction of maximum growth in each of the linear 
combinations. 
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Figure 4 Kernel PCA of gene expression. The genes AOX (blue 
vector) and CARl (green vector) are represented at each sample 
point. WT samples are represented in black and PPAR samples in 
red. Diet representation is: (ref) diet by the letter x; (coc) diet by 
circles; (sun) diet by diamonds; (lin) diet by plus signs; and (fish) diet 
by triangles. 
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Figure 5 AOX gene profile. Profile of the median gene expression 
of tine AOX gene. 



Data integration and representation of input variables 

The kernel formalism allows us to combine heteroge- 
neous datasets for data fusion. Basic algebraic operations 
such as addition, multiplication and exponentiation pre- 
serve the key properties of symmetry and positive semi- 
definiteness, and thus allow a simple but powerful 
algebra of kernels. If ki and k2 are kernels defined 
respectively on A'l x and A2 x A2, then their direct 
sum: 

{ki efe2)(Xl,X2,Xi,X2) = ki{Xi,Xi) = kjiXjrXj) 

is a kernel on (A*! x A2) x (A'l x A2) . Here, 
xi,x^^ G A'l and X2,X2 G A^j- 

This construction can be useful if the different parts 
of the input have different meanings and should there- 
fore be dealt with differently. In that case, we can split 
the inputs into two parts, and X2, and use two differ- 
ent kernels for these parts. This is the case when we are 
integrating two separate datasets. In consequence, our 
procedure can easily be extended to data fusion. Firstly, 
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Figure 7 Kernel PCA of fatty acid concentrations. Tine fatty acids 
CI 6.0 (blue vector) and C20.2co.6 (green vector) are represented at eacli 
sample point. WT samples are represented in black and PPAR samples in 
red. Diet representation is: (ref) diet by the letter x; (coc) diet by circles; 
(sun) diet by diamonds; (lin) diet by plus signs; and (fish) diet by triangles. 



we reduce the dimension of the entire data (xi^, X2/), / = 
1, m, by applying kernel PCA with the kernel K given 
by ki 0 k2' Secondly, to find the coordinates of a test 
point: 

y= (ypy2)' 

we proceed by analogy with (8), so that (7) becomes: 

Z = (i^(yi,Xi,-,y2,X20)^ X 1 = (^1 (yi'^lO + ^2 {Y2'^2i))^ ^ 1 . 

When we integrate two datasets, we can represent any 
given input variable that belongs to one of the datasets. 
Let us suppose that we wish to represent the variable x[ 
that belongs to the dataset / = 1, 2. Then (2) becomes: 

dKi{yi,xii) 



dZ] 
ds 
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Figure 6 CAR1 gene profile. Profile of the median 
expression of the CAR! gene. 
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Figure 8 C16.0 fatty acid profile. Profile of the median 
concentrations of the CI 6.0 fatty acid. 
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Figure 9 C20.2ty.6 fatty acid profile Profile of the median 
concentrations of tine C20.2co.6 fatty acid. 



Then, formula (1) allows us to display variables that 
belong to any of the datasets over the kernel PCA repre- 
sentation of samples, simultaneously. 
Analyzing the nutrigenomic dataset 

Continuing with the same nutrigenomic study, we com- 
pute kernel PCA by analyzing both datasets simulta- 
neously; that is, gene expressions and FA concentrations. 
We observe that the genotypes are clearly separated (WT 
is represented in black and PPAR in red) and also mice 
with the coc diet form a cluster of both genotypes; see 
Figure 10. Also, Figure 10 shows AOX (black vector) and 
CARl (green vector) genes, and C20.2a;.6 (blue vector) 
and C16.0 (red vector) FAs. It reveals higher expression of 
CARl and higher concentrations of C202co,6 towards the 
PPAR cluster. In contrast, AOX gene expression and con- 
centrations of C16.0 are higher towards the WT cluster. 
These results are in agreement with those found in the 
individual kernel PC As above. 

Representation of linear combinations of input variables 

A natural extension of the above procedure is the 
representation of linear combinations of input vari- 
ables. This may be useful for representing gene mod- 
ules or gene networks. Let us suppose that we wish to 
represent the linear combination: Xk^ +Xk2 + • • • +Xfe,, 
where ki, /c2^...,/c/ e{1, 2, n}, with ki kj , = 1, /. 
Then, when K is the Gaussian radial basis function 
kernel, (2) becomes: 



ds 



t=i 



ly=a- 



Then, formula (1) allows us to represent any linear 
combination of input variables. 
Analyzing the nutrigenomic dataset 

To illustrate this procedure we have analyzed the genes 
GSTpi2, CYP3A11 and CYP2c29. These genes are 
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Figure 10 Kernel PCA analyzing gene expression and fatty acid 
concentrations simultaneously. The genes AOX (black vector) and 
CARl (green vector) and fatty acids C20.2co.6 (blue vector) and 
CI 6.0 (red vector) are represented at each sample point. The WT 
samples are represented in black and the PPAR samples in red. Diet 
representation is: (ref) diet by the letter x; (coc) diet by circles; (sun) 
diet by diamonds; (lin) diet by plus signs; and (fish) diet by triangles. 



involved in the functioning of detoxification [12]. We 
perform kernel PCA analyzing both dataset simulta- 
neous and represent the sum of the expressions of the 
genes GSTpi2, CYP3A11 and CYP2c29. Figure 11 shows 
sample points and the vector corresponding to the sum 
of the three gene expressions is attached to each point. 
The vector indicates the direction of maximum growth 
of the sum of the expressions. We observe that the sum 
of the expressions increases towards the fish diet. This 
is in agreement with the findings in [12]. 

Revealing the interpretability of input variables 

Our procedure for representing input variables on the 
two-dimensional subspace expanded by the two main 
eigenvectors of q displays the variables as vectors 
whose direction is the direction of maximum growth of 
the variable at a given point; in particular, at the sample 
points. 

So, if we set a direction in this plane, given by a vector 
w, we can search for input variables whose representa- 
tion on the kernel PCA plane are correlated with this 
direction. Let us suppose that we observe clusters of 
samples in the kernel PCA representation; then an inter- 
esting direction can be given by the vector defined by 
any two cluster centroids. 

Once we have selected a vector w, we denote Wi as 
the parallel vector of w attached to the image given by 
kernel PCA of the sample point x/, / = 1, m. For 
any variable X/^, we now compute its vector representa- 
tion in kernel PCA using formula (1); we denote this 

vector as . Therefore, for each sample point, 

5=0 



ds 
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Figure 11 Representation of linear combinations of input 
variables. The sum of the expression of the genes: GSTpi2, 
CYP3A1 1 and CYP2c29 is represented. These genes are associated 
with detoxification. Wild type samples are represented in black and 
PPAR samples in red. Diet representation is: (ref) diet by the letter x; 
(coc) diet by circles; (sun) diet by diamonds; (lin) diet by plus signs; 
and (fish) diet by triangles. 



X/, / = 1, m, we have two vectors, one corresponding 
to the direction Wi, and other corresponding to the X/^ 

After this, to measure the 



representation, — — 
as 

strength of the correlation between X/^ and w, we aver- 
age the cosine of the angles between each pair of vec- 
tors, that is: 



1 A / dcr\ ^ \ 



Finally, we order all the variables according to R/^ and 
we can select those with higher values and also those 
with lower values. Thus, in this way, for each sample 
cluster, we can find the correlated variables with higher 
and lower values. Knowledge of such variables can 
improve the biological interpretability of the results. 

A natural extension of this procedure is to take as w 
the vector corresponding to one of the input variables. 
Then, if we know that a certain input variable is useful 
for interpreting the kernel PCA representation, we can 
search for other input variables whose representation on 
the kernel PCA plane are correlated with this feature. If 
we are integrating multiple datasets, we can search for 
correlated variables in each dataset. 
Analyzing the nutrigenomic dataset 

To illustrate this procedure. We have selected a preferred 
direction in the kernel PCA plane. Figure 12 shows this 
direction (green vector). This direction represents vari- 
ables that are less expressed in samples with the coc diet 
than in those with other diets. Tables 1 and 2 summarize 
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Figure 12 Kernel PCA analyzing gene expression and fatty acid 
concentrations simultaneously. The green vector represents 
variables that are expressed less in samples with the coc diet. It is 
defined by two cluster centroids: the left-hand cluster is the coc 
diet; and the right-hand cluster is comprised of the other diets. 



the genes and FAs that are most correlated with the 
selected direction. 

In Table 1, we can observe that FAs with negative corre- 
lation, such as C16.1a;.7, 020.3^0.9 and ClSAcoJ, repre- 
sent FAs with higher concentrations in samples with the 
coc diet. In contrast, FAs that are positively correlated, 
such as C22.4a;.6, C18.2a;.6, ClS.Sco.S and C22.5a;.6, 
represent FAs with higher concentrations in samples with 
other types of diet. Furthermore, in Table 2, we can 
observe that genes with negative correlation at the top of 
the table, such as S14, ACC2 and LPL, are more highly 
expressed in samples with the coc diet, whereas genes at 
the bottom of the table, that are positively correlated, are 
less expressed in the coc diet samples. These results are 
in agreement with those found in [12]. 

Conclusions 

With the rapidly increasing amount of genomic, proteo- 
mic, and other high-throughput data that is available, 
the importance of data integration has increased signifi- 
cantly recently. Biologists, medical scientists, and clini- 
cians are also interested in integrating the high- 
throughput data that has recently become available with 
previously existing clinical, laboratory and biological 
information. 

Kernel methods, in particular kernel PCA, constitute a 
powerfully methodology because they allow us to reduce 
dimensionality and integrate multiple datasets, simulta- 
neously. Moreover, in this paper we have introduced a 
set of procedures to improve the interpretability of ker- 
nel PCA representations. The procedures are related to 
the following aspects: 1) representation of variables; 2) 
linear combination of representations of variables; 3) 
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Table 1 Fatty acids: correlation with the preferred 
direction. 



FA 


mean 


sd 


aeAco.v 


-0.927 


0.100 


C203CO.9 


-0.917 


0.336 


C18.1C0.7 


-0.907 


0.270 


CI 4.0 


-0.898 


0.131 


C18.3ft>.6 


-0.862 


0.372 


C18.1C0.9 


-0.695 


0.132 


C16.1C0.9 


-0.480 


0.224 


CI 6.0 


-0.295 


0.265 


C20.1a;.9 


0.176 


0.401 


C225co3 


0.198 


0.346 


C203co3 


0.235 


0.383 


C20.5CO.3 


0.300 


0.219 


C20.3CO.6 


0.386 


0.227 


CI 8.0 


0.392 


0.171 


C22.6C0.3 


0.453 


0.151 


C20.2CO.6 


0.601 


0.306 


C20ACO.6 


0.664 


0.360 


C22.4C0.6 


0.684 


0.367 


C18.2cy.6 


0.718 


0.290 


C18.3g;.3 


0.727 


0.482 


C22.5C0.6 


0.731 


0.499 


Fatty acids. Mean and standard deviation of the Rk measure of the strength of 


correlation with the preferred direction. 




Table 2 Genes: correlation with the preferred direction. 


gene 


mean 


sd 


S14 


-0.998 


0.002 


ACC2 


-0.997 


0.004 


LPL 


-0.997 


0.005 


ap2 


-0.996 


0.006 


NGFiB 


-0.996 


0.005 


i.FABP 


-0.995 


0.007 


C0X1 


-0.993 


0.012 


CIDEA 


-0.993 


0.012 


MDR1 


-0.991 


0.016 


Lpin 


-0.991 


0.007 


MTHFR 


-0.991 


0.012 


Lpini 


-0.989 


0.009 


i.BAT 


-0.988 


0.014 


PPARg 


-0.986 


0.025 


ACAT2 


-0.984 


0.013 


CYP2b10 


-0.978 


0.022 


hABCI 


-0.976 


0.021 


ACC1 


-0.975 


0.012 


SPI1.1 


0.353 


0.042 


GSTpi2 


0.587 


0.038 



Gene codes. Mean and standard deviation of the Rk measure of the strength 
of correlation with the preferred direction. 



data integration and representation of variables; and 4) 
revealing the interpretability of input variables. Our pro- 
cedure is a kernel-based exploratory tool for data 
mining that enables us to extract nonlinear features 
while representing variables. 

Methods 

Given a sample space A', a real valued positive 
definite kernel k on X is a. map k : X x X ^ R 

Em 
aiajk{xi,Xj) > 0 for all 

m e N, ai e R, Xi e Pi! i = I, . . . , m, and kernel is zero is 
attained if all the coefficients aj are zero. A kernel can 
be interpreted as a similarity measure of the samples 
and allow us to identify each x G Afwith a real function 
given by 

0 : A' ^ M'^ = {/ : A' ^ M} 

X \-> = k{-,x) 

which is an element of a dot product vector space that 
will be called feature space [5]. It consists of all func- 
tions 

m 

1=1 

for any m G N and Xi, ... ,Xm G X ,ai, . . . ,am G M. It 
has the reproducing property 

< fe(-, x),f >=/M 

Implying [(p{x), (p{y)) = {k[-, x), k[-, y)) = k[x, y). After 
completion we can turn our feature space into a Hilbert 
space [5]. The space ^ is the reproducing kernel Hil- 
bert space (RKHS) induced by the kernel function k. 

Given any (p and any set of observations Xi, Xy^^ the 
Gram or kernel matrix of k with respect the m 

X m matrix /C with elements Kij = = k{xi,Xj). 

Let us define 

^ m 

m ^ 

i=\ 

then, the points 

^)[xi) = (p[xi) - (f (3) 

will be centered. Let x be denote the kernel matrix of 

centered points, Kij = Because we do not 

have the centered data (3), we cannot compute x expli- 
citly, however we can express it in terms of its noncen- 
tered counterpart K [5]. Using the vector 1^ = (1, 1)^, 
we get the more compact expression 
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1 J 1 

K = K Klyn^ 

m m 



In the covariance matrix takes the form 

m j=i 

We have to find eigenvalues X > 0 and nonzero eigen- 
vectors Y el-Lk\{0} satisfy^ing 



CV = XY 



(4) 



To find the solutions of (4) we solve the dual eigenva- 
lue problem 



ka = mka, 



(5) 



with a being the expansion coefficients of an eigenvec- 
tor (in ^) in terms of the centered points (3) 



V= 



(6) 



The solution cc^^ = 1, r, are normalized by normal- 
izing the corresponding vector in ^/^, which trans- 
lates into Xk {cL^, d^^ = L 

Consider a test point y. To find its coordinates we 
compute projections of centered (p-images of y onto the 
eigenvectors of the covariance matrix of the centered 
points, 

m 

= ^^\[v{y)-v>v{^i)-v) 

i=l 
m 

i=l 

m f ^ m 

i=l ^ 5=1 
^ m ^ m 

V K{y, X5) + — V K{xs, Xt)]. 

5=1 5,t=l 

Introducing the vector 

z=(^(y'^.)Lxi- (7) 

Then, 

W //ixr m m 

^T7{\^- ll^lT^ V- -\lK (im- -Irnll] V (8) 

V m J m \ m J ^ ^ 

\ m J \ m J 

where v is a m x r matrix whose columns are the 
eigenvectors . . . V^- 
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